Liu et al - 2021 - Swin Transformer Hierarchical Vision Transformer

Useful Items

Storage
- Cloud Storage
- Local Storage

The architecture of Swin Transformer Detailed architecture specifications

Stage 1: Linear projection of flatterned non-overlapping patches as their respective token/feature, then feed these tokens added with position embedding to first two successive Swin Transformer Blocks
Stage 2&3&4: The first patch merging layer concatenates the features of each group of \( 2 \times 2 \) neighboring patches, then applies a linear layer on the concatated features to downsample feature channels 2 times, and last feed these processed features to stacked two consecutive Swin Transformer Blocks

Main Characteristics

Swin Transformer characteristics vs ViT

hierarchical feature maps by merging image patches in deeper layers
linear computation complexity to input image size due to computation of self-attention only within each local window

window multi-head self attention(W-MSA)
- W-MSA for linear computation complexity to image size \begin{align} &\Omega(MSA)=4hwC^2 + 2(hw)^2C \\ &\Omega(W\mbox{-}MSA)=4hwC^2 + 2M^2hwC \\ \end{align}

shifted window multi-head self attention(SW-MSA)
- SW-MSA for augement W-MSA by introducing connections between neighboring non-overlapping windows in the previous layer

Shifted windows approach

\begin{align} &\hat{z}^l=W\mbox{-}MSA(LN(z^{l-1}))+z^{l-1} \\ &z^l=MLP(LN(\hat{z}^l))+\hat{z}^l \\ &\hat{z}^{l+1}=SW\mbox{-}MSA(LN(z^l))+z^l \\ &z^{l+1}=MLP(LN(\hat{z}^{l+1}))+\hat{z}^{l+1} \\ \end{align}

cyclic-shifting toward the top-left direction
- a batched window may be composed of several sub-windows that are not adjacent in the feature map, so a masking mechanism is employed to limit self-attention computation to within each sub-window

Cycylic shift for efficient batch computation

relative position bias \[ Attention(Q,K,V)=SoftMax(QK^T/\sqrt{d}+B)V \] where \( Q, K, V\in \mathbb{R}^{M^2\times d} \) are the query, key and value matrices; d is the query/key dimension, and \( M^2 \) is the number of patches in a window

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows | home

Key Ideas

Overall Architecture

Main Characteristics

Well-designed Details